204 research outputs found

    Inference of Markovian Properties of Molecular Sequences from NGS Data and Applications to Comparative Genomics

    Full text link
    Next Generation Sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is a promising approach, but for this approach, the underlying expected word counts are essential. A plausible model for this underlying distribution of word counts is given through modelling the DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do not provide a single long sequence, inference methods on Markovian properties of sequences based on single long sequences cannot be directly used for NGS short read data. Here we derive a normal approximation for such word counts. We also show that the traditional Chi-square statistic has an approximate gamma distribution, using the Lander-Waterman model for physical mapping. We propose several methods to estimate the order of the MC based on NGS reads and evaluate them using simulations. We illustrate the applications of our results by clustering genomic sequences of several vertebrate and tree species based on NGS reads using alignment-free sequence dissimilarity measures. We find that the estimated order of the MC has a considerable effect on the clustering results, and that the clustering results that use a MC of the estimated order give a plausible clustering of the species.Comment: accepted by RECOMB-SEQ 201

    Comparison of metagenomic samples using sequence signatures

    Get PDF
    BACKGROUND: Sequence signatures, as defined by the frequencies of k-tuples (or k-mers, k-grams), have been used extensively to compare genomic sequences of individual organisms, to identify cis-regulatory modules, and to study the evolution of regulatory sequences. Recently many next-generation sequencing (NGS) read data sets of metagenomic samples from a variety of different environments have been generated. The assembly of these reads can be difficult and analysis methods based on mapping reads to genes or pathways are also restricted by the availability and completeness of existing databases. Sequence-signature-based methods, however, do not need the complete genomes or existing databases and thus, can potentially be very useful for the comparison of metagenomic samples using NGS read data. Still, the applications of sequence signature methods for the comparison of metagenomic samples have not been well studied. RESULTS: We studied several dissimilarity measures, including d(2), d(2)(*) and d(2)(S) recently developed from our group, a measure (hereinafter noted as Hao) used in CVTree developed from Hao’s group (Qi et al., 2004), measures based on relative di-, tri-, and tetra-nucleotide frequencies as in Willner et al. (2009), as well as standard l(p) measures between the frequency vectors, for the comparison of metagenomic samples using sequence signatures. We compared their performance using a series of extensive simulations and three real next-generation sequencing (NGS) metagenomic datasets: 39 fecal samples from 33 mammalian host species, 56 marine samples across the world, and 13 fecal samples from human individuals. Results showed that the dissimilarity measure d(2)(S) can achieve superior performance when comparing metagenomic samples by clustering them into different groups as well as recovering environmental gradients affecting microbial samples. New insights into the environmental factors affecting microbial compositions in metagenomic samples are obtained through the analyses. Our results show that sequence signatures of the mammalian gut are closely associated with diet and gut physiology of the mammals, and that sequence signatures of marine communities are closely related to location and temperature. CONCLUSIONS: Sequence signatures can successfully reveal major group and gradient relationships among metagenomic samples from NGS reads without alignment to reference databases. The d(2)(S) dissimilarity measure is a good choice in all application scenarios. The optimal choice of tuple size depends on sequencing depth, but it is quite robust within a range of choices for moderate sequencing depths

    Responses of soil nitrogen mineralization to temperature and moisture in alpine ecosystems on the Tibetan Plateau

    Get PDF
    AbstractThe responses of soil net nitrogen (N) mineralization to temperature and moisture were investigated in four alpine ecosystems of forest, shrub, meadow and steppe by laboratory incubation method with undisturbed soil cores on the Tibetan Plateau. The results indicated the soil net N mineralization varies greatly between alpine ecosystems. The soil net N mineralization rate in three incubating moisture of forest ecosystem rose markedly, and that of meadow ecosystem rose gently from temperature of 5°C to 35°C, while that of shrub and steppe ecosystems increased from temperature of 5°C to 25°C and reduced from temperature of 25°C to 35°C. At the same incubating temperature, the soil net N mineralization of four alpine ecosystems increased in the middle moisture and deceased in the low or high moisture

    The impact of atmospheric N deposition and N fertilizer type on soil nitric oxide and nitrous oxide fluxes from agricultural and forest Eutric Regosols

    Get PDF
    Agricultural and forest soils with low organic C content and high alkalinity were studied over 17 days to investigate the potential response of the atmospheric pollutant nitric oxide (NO) and the greenhouse gas nitrous oxide (N2O) on (1) increased N deposition rates to forest soil; (2) different fertilizer types to agricultural soil and (3) a simulated rain event to forest and agricultural soils. Cumulative forest soil NO emissions (148–350 ng NO-N g−1) were ~ 4 times larger than N2O emissions (37–69 ng N2O-N g−1). Contrary, agricultural soil NO emissions (21–376 ng NO-N g−1) were ~ 16 times smaller than N2O emissions (45–8491 ng N2O-N g−1). Increasing N deposition rates 10 fold to 30 kg N ha−1 yr−1, doubled soil NO emissions and NO3− concentrations. As such high N deposition rates are not atypical in China, more attention should be paid on forest soil NO research. Comparing the fertilizers urea, ammonium nitrate, and urea coated with the urease inhibitor ‘Agrotain®,’ demonstrated that the inhibitor significantly reduced NO and N2O emissions. This is an unintended, not well-known benefit, because the primary function of Agrotain® is to reduce emissions of the atmospheric pollutant ammonia. Simulating a climate change event, a large rainfall after drought, increased soil NO and N2O emissions from both agricultural and forest soils. Such pulses of emissions can contribute significantly to annual NO and N2O emissions, but currently do not receive adequate attention amongst the measurement and modeling communities

    Liao ning virus in China

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Liao ning virus is in the genus Seadornavirus within the family Reoviridae and has a genome composed of 12 segments of double-stranded RNA (dsRNA). It is transmitted by mosquitoes and only isolated in China to date and it is the only species within the genus Seadornavirus which was reported to have been propagated in mammalian cell lines. In the study, we report 41 new isolates from northern and southern Xinjiang Uygur autonomous region in China and describe the phylogenetic relationships among all 46 Chinese LNV isolates.</p> <p>Findings</p> <p>The phylogenetic analysis indicated that all the isolates evaluated in this study can be divided into 3 different groups that appear to be related to geographic origin based on partial nucleotide sequence of the 10th segment which is predicted to encode outer coat proteins of LNV. Bayesian coalescent analysis estimated the date of the most recent common ancestor for the current Chinese LNV isolates to be 318 (with a 95% confidence interval of 30-719) and the estimated evolutionary rates is 1.993 × 10<sup>-3 </sup>substitutions per site per year.</p> <p>Conclusions</p> <p>The results indicated that LNV may be an emerging virus at a stage that evaluated rapidly and has been widely distributed in the north part of China.</p
    corecore